I am a computational biologist, an Asst. Professor in the department of Pathobiology & Diagnostic Investigation at Michigan State University. In our group, we develop computational approaches to understand infectious disease biology. Check out my webpage for more info. You can reach me here.
I’m also the founder & co-organizer of the R-Ladies East Lansing group on campus. We conduct R-related workshops & meetups regularly! So, do check out our upcoming events on Meetup.
readrYou can access all relevant material pertaining to this workshop here. Other related workshops & useful cheatsheets.
Running RStudio locally? Download RStudio
Want to try the latest ‘Preview’ version of RStudio? RStudio Preview version
Trouble with local installation? Login & start using RStudio Cloud right away!
… if you haven’t already! The RStudio startup message should specify your current local version of R. For e.g., R v4.0.5
install.packages("tidyverse") # for data wrangling
install.packages("here") # to set paths relative to your current project.
# install.packages("gapminder") # sample dataset
Explore the RLEL workshops for more examples and sample codes for Tidy Data and DataViz [e.g., >> presentations/20181105-workshop-tidydata uses the gapminder & USArrests datasets].
Trouble installing tidyverse?
install.packages("PACKAGENAME")tidyverse suite of packages here# If tidyverse installation fails, install individual constituent packages this way...
install.packages("readr") # Importing data files
install.packages("tidyr") # Tidy Data
install.packages("dplyr") # Data manipulation
install.packages("ggplot2") # Data Visualization (w/ Grammar of Graphics)
install.packages("readxl") # Importing excel files
library(tidyverse)
## ── Attaching packages ─────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.5
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# OR load the individual packages, if you have trouble installing/loading `tidyverse`
# library(readr)
# library(readxl)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(here) # https://github.com/jennybc/here_here
## here() starts at /Users/jananiravi/GitHub/tidyverse-genomics
# library(gapminder) # useful dataset for data wrangling, visualization
Cheatsheets @RStudio | Our cheatsheets repo
More resources towards the end of the document.
read_csv, write_csvread_tsv, write_tsvread_delim, write_delimhere::herehere::here() # where you opened your RStudio session/Project
## [1] "/Users/jananiravi/GitHub/tidyverse-genomics"
# Downloading sample genomics dataset from NCBI's FTP site
url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE69nnn/GSE69360/suppl/GSE69360_RNAseq.counts.txt.gz"
gse69360 <- read_tsv(url) # to read tab-delimitted files
##
## ── Column specification ────────────────────────────────────
## cols(
## .default = col_double(),
## Geneid = col_character(),
## Chr = col_character(),
## Start = col_character(),
## End = col_character(),
## Strand = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
?read_tsv # to check defaults
# Comma-separated values, as exported from excel/spreadsheets
read_csv(file="path/to/my_data.csv", col_names=T)
# Other atypical delimitters
read_delim(file="path/to/my_data.txt", col_names=T, delim="//")
# Other useful packages
# readxl by Jenny Bryan
read_excel(path="path/to/excel.xls",
sheet=1,
range="A1:D50",
col_names=T)
## Tip: Always open .Rproj and use relative paths with here()
## Example with here()
read_tsv(file=here("data/GSE69360_RNAseq.counts.txt"), col_names=T)
Dataset details:
str(gse69360) # Structure of the dataframe
gse69360 # Data is in a cleaend up 'tibble' format by default
head(gse69360) # Shows the top few observations (rows) of your data frame
glimpse(gse69360) # Info-dense summary of the data
View(head(gse69360, 100)) # View data in a visual GUI-based spreadsheet-like format
colnames(gse69360) # Column names
nrow(gse69360) # No. of rows
ncol(gse69360) # No. of columns
gse69360[1:5,7:10] # Subsetting a dataframe
## saving the data file
write_tsv(gse69360[1:100,7:12], "gse_subset.txt")
library(knitr)
kable(head(gse69360))
| Geneid | Chr | Start | End | Strand | Length | AA_Colon | AA_Heart | AA_Kidney | AA_Liver | AA_Lung | AA_Stomach | AF_Colon | AF_Stomach | BA_Colon | BA_Heart | BA_Kidney | BA_Liver | BA_Lung | BA_Stomach | BF_Colon | BF_Stomach | OA_Stomach1 | OA_Stomach2 | OA_Stomach3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ENSG00000223972.4 | chr1;chr1;chr1;chr1 | 11869;12595;12975;13221 | 12227;12721;13052;14412 | +;+;+;+ | 1756 | 1 | 0 | 0 | 0 | 0 | 0 | 7 | 2 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 3 | 3 | 1 | 0 |
| ENSG00000227232.4 | chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1;chr1 | 14363;14970;15796;16607;16854;17233;17498;17602;17915;18268;24734;29321;29534 | 14829;15038;15947;16765;17055;17368;17504;17742;18061;18379;24891;29370;29806 | -;-;-;-;-;-;-;-;-;-;-;-;- | 2073 | 36 | 21 | 32 | 28 | 17 | 17 | 102 | 41 | 10 | 24 | 22 | 6 | 49 | 2 | 153 | 157 | 38 | 53 | 18 |
| ENSG00000243485.2 | chr1;chr1;chr1 | 29554;30267;30976 | 30039;30667;31109 | +;+;+ | 1021 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000237613.2 | chr1;chr1;chr1 | 34554;35245;35721 | 35174;35481;36081 | -;-;- | 1219 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000268020.2 | chr1;chr1 | 52473;54830 | 53312;54936 | +;+ | 947 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000240361.1 | chr1 | 62948 | 63887 | + | 940 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
library(rmarkdown)
paged_table(gse69360)
tidyr## legacy tidyverse funcgions
gather() # Gather COLUMNS -> ROWS
spread() # Spread ROWS -> COLUMNS
## newer tidyverse functions
pivot_longer() # Pivot data from wide to long
pivot_wider() # Pivot data from long to wide
separate() # Separate 1 COLUMN -> many COLUMNS
unite() # Unite several COLUMNS -> 1 COLUMN
gather: Gather columns into key-value pairs. Wide -> Longspread: Spread a key-value pair across multiple columns: Long -> Wide# Gather all columns except 'Geneid'
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
# gather(-Geneid, key="Sample", value="Counts") # wide -> long format ## legacy
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts")
## # A tibble: 1,098,580 x 3
## Geneid Sample Counts
## <chr> <chr> <dbl>
## 1 ENSG00000223972.4 AA_Colon 1
## 2 ENSG00000223972.4 AA_Heart 0
## 3 ENSG00000223972.4 AA_Kidney 0
## 4 ENSG00000223972.4 AA_Liver 0
## 5 ENSG00000223972.4 AA_Lung 0
## 6 ENSG00000223972.4 AA_Stomach 0
## 7 ENSG00000223972.4 AF_Colon 7
## 8 ENSG00000223972.4 AF_Stomach 2
## 9 ENSG00000223972.4 BA_Colon 0
## 10 ENSG00000223972.4 BA_Heart 0
## # … with 1,098,570 more rows
# Gather, then Spread --> Back to original data
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
# gather(-Geneid, key="Sample", value="Counts") %>%
# spread(key = "Sample", value = "Counts") # spread to turn back to the original data!
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
pivot_wider(names_from="Sample", values_from="Counts")
## # A tibble: 57,820 x 20
## Geneid AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG0000022… 1 0 0 0 0
## 2 ENSG0000022… 36 21 32 28 17
## 3 ENSG0000024… 0 0 0 0 0
## 4 ENSG0000023… 0 0 0 0 0
## 5 ENSG0000026… 0 0 0 0 0
## 6 ENSG0000024… 0 0 0 0 0
## 7 ENSG0000018… 0 0 0 0 0
## 8 ENSG0000023… 3 0 1 2 1
## 9 ENSG0000023… 0 0 0 0 0
## 10 ENSG0000023… 0 0 0 0 0
## # … with 57,810 more rows, and 14 more variables:
## # AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## # BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## # BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## # BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## # OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
unite: Unite multiple columns into oneseparate: Separate one column into multiple columnsgse69360 %>%
select(Geneid, matches("[AF]_")) %>% # selecting only Counts columns
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>% # wide -> long
separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") # separate logically
## # A tibble: 1,098,580 x 4
## Geneid Source_Stage Tissue Counts
## <chr> <chr> <chr> <dbl>
## 1 ENSG00000223972.4 AA Colon 1
## 2 ENSG00000223972.4 AA Heart 0
## 3 ENSG00000223972.4 AA Kidney 0
## 4 ENSG00000223972.4 AA Liver 0
## 5 ENSG00000223972.4 AA Lung 0
## 6 ENSG00000223972.4 AA Stomach 0
## 7 ENSG00000223972.4 AF Colon 7
## 8 ENSG00000223972.4 AF Stomach 2
## 9 ENSG00000223972.4 BA Colon 0
## 10 ENSG00000223972.4 BA Heart 0
## # … with 1,098,570 more rows
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
separate(Source_Stage, into=c("Source", "Stage"), sep=1) # separate by char position
## # A tibble: 1,098,580 x 5
## Geneid Source Stage Tissue Counts
## <chr> <chr> <chr> <chr> <dbl>
## 1 ENSG00000223972.4 A A Colon 1
## 2 ENSG00000223972.4 A A Heart 0
## 3 ENSG00000223972.4 A A Kidney 0
## 4 ENSG00000223972.4 A A Liver 0
## 5 ENSG00000223972.4 A A Lung 0
## 6 ENSG00000223972.4 A A Stomach 0
## 7 ENSG00000223972.4 A F Colon 7
## 8 ENSG00000223972.4 A F Stomach 2
## 9 ENSG00000223972.4 B A Colon 0
## 10 ENSG00000223972.4 B A Heart 0
## # … with 1,098,570 more rows
gse69360 %>%
select(Geneid, matches("[AF]_")) %>%
pivot_longer(cols=matches("[AF]_"), names_to="Sample", values_to="Counts") %>%
separate(Sample, into=c("Source_Stage", "Tissue"), sep="_") %>%
separate(Source_Stage, into=c("Source", "Stage"), sep=1) %>%
unite(Stage_Tissue, Stage, Tissue) # combining a different set of columns
## # A tibble: 1,098,580 x 4
## Geneid Source Stage_Tissue Counts
## <chr> <chr> <chr> <dbl>
## 1 ENSG00000223972.4 A A_Colon 1
## 2 ENSG00000223972.4 A A_Heart 0
## 3 ENSG00000223972.4 A A_Kidney 0
## 4 ENSG00000223972.4 A A_Liver 0
## 5 ENSG00000223972.4 A A_Lung 0
## 6 ENSG00000223972.4 A A_Stomach 0
## 7 ENSG00000223972.4 A F_Colon 7
## 8 ENSG00000223972.4 A F_Stomach 2
## 9 ENSG00000223972.4 B A_Colon 0
## 10 ENSG00000223972.4 B A_Heart 0
## # … with 1,098,570 more rows
dplyrconflicted::conflict_prefer(name="filter", winner="dplyr")
filter() # PICK observations by their values | ROWS
select() # PICK variables by their names | COLUMNS
mutate() # CREATE new variables w/ functions of existing variables | COLUMNS
transmute() # COMPUTE 1 or more COLUMNS but drop original columns
arrange() # REORDER the ROWS
summarize() # COLLAPSE many values to a single SUMMARY
group_by() # GROUP data into rows with the same value of variable (COLUMN)
filter: Return rows with matching conditionshead(gse69360) # Snapshot of the dataframe
## # A tibble: 6 x 25
## Geneid Chr Start End Strand Length AA_Colon AA_Heart
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG00… chr1;… 1186… 1222… +;+;+… 1756 1 0
## 2 ENSG00… chr1;… 1436… 1482… -;-;-… 2073 36 21
## 3 ENSG00… chr1;… 2955… 3003… +;+;+ 1021 0 0
## 4 ENSG00… chr1;… 3455… 3517… -;-;- 1219 0 0
## 5 ENSG00… chr1;… 5247… 5331… +;+ 947 0 0
## 6 ENSG00… chr1 62948 63887 + 940 0 0
## # … with 17 more variables: AA_Kidney <dbl>,
## # AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
## # AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
## # BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
## # BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
## # BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
## # OA_Stomach3 <dbl>
# Now, filter by condition
filter(gse69360, Length<=50)
## # A tibble: 124 x 25
## Geneid Chr Start End Strand Length AA_Colon AA_Heart
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG00… chr1… 3829… 3829… -;- 45 1 0
## 2 ENSG00… chr1 9521… 9521… - 41 7 1
## 3 ENSG00… chr1 1477… 1477… + 35 0 1
## 4 ENSG00… chr1… 1545… 1545… +;+ 21 0 0
## 5 ENSG00… chr1 2360… 2360… - 50 1 2
## 6 ENSG00… chr2… 1160… 1160… +;+ 36 1 0
## 7 ENSG00… chr2 8916… 8916… - 38 183 1
## 8 ENSG00… chr2 8916… 8916… - 37 98 2
## 9 ENSG00… chr2 8916… 8916… - 38 80 0
## 10 ENSG00… chr2 8916… 8916… - 38 0 0
## # … with 114 more rows, and 17 more variables:
## # AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
## # AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## # BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## # BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## # BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## # OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Can be rewritten using "Piping" %>%
gse69360 %>% # Pipe ('then') operator to serially connect operations
filter(Length <= 50)
## # A tibble: 124 x 25
## Geneid Chr Start End Strand Length AA_Colon AA_Heart
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG00… chr1… 3829… 3829… -;- 45 1 0
## 2 ENSG00… chr1 9521… 9521… - 41 7 1
## 3 ENSG00… chr1 1477… 1477… + 35 0 1
## 4 ENSG00… chr1… 1545… 1545… +;+ 21 0 0
## 5 ENSG00… chr1 2360… 2360… - 50 1 2
## 6 ENSG00… chr2… 1160… 1160… +;+ 36 1 0
## 7 ENSG00… chr2 8916… 8916… - 38 183 1
## 8 ENSG00… chr2 8916… 8916… - 37 98 2
## 9 ENSG00… chr2 8916… 8916… - 38 80 0
## 10 ENSG00… chr2 8916… 8916… - 38 0 0
## # … with 114 more rows, and 17 more variables:
## # AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
## # AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## # BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## # BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## # BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## # OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Filtering using regex/substring match
gse69360 %>%
filter(grepl("chrY", Chr))
## # A tibble: 542 x 25
## Geneid Chr Start End Strand Length AA_Colon AA_Heart
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSGR0… chrY… 1204… 1205… +;+ 259 0 0
## 2 ENSGR0… chrY… 1429… 1430… +;+;+… 6191 0 0
## 3 ENSGR0… chrY… 1700… 1701… -;-;-… 2363 0 0
## 4 ENSGR0… chrY… 2317… 2319… +;+ 428 0 0
## 5 ENSGR0… chrY… 2446… 2496… -;-;-… 9015 0 0
## 6 ENSGR0… chrY 3753… 3754… - 101 0 0
## 7 ENSGR0… chrY 4345… 4348… - 328 0 0
## 8 ENSGR0… chrY 4559… 4560… - 117 0 0
## 9 ENSGR0… chrY… 5350… 5353… +;+;+… 4384 0 0
## 10 ENSGR0… chrY… 9009… 9011… +;+ 588 0 0
## # … with 532 more rows, and 17 more variables:
## # AA_Kidney <dbl>, AA_Liver <dbl>, AA_Lung <dbl>,
## # AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## # BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## # BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## # BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## # OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Two filters at a time
gse69360 %>%
filter(Length <= 50 & grepl("chrY", Chr))
## # A tibble: 1 x 25
## Geneid Chr Start End Strand Length AA_Colon AA_Heart
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 ENSG00… chrY;… 5306… 5306… -;- 42 1 0
## # … with 17 more variables: AA_Kidney <dbl>,
## # AA_Liver <dbl>, AA_Lung <dbl>, AA_Stomach <dbl>,
## # AF_Colon <dbl>, AF_Stomach <dbl>, BA_Colon <dbl>,
## # BA_Heart <dbl>, BA_Kidney <dbl>, BA_Liver <dbl>,
## # BA_Lung <dbl>, BA_Stomach <dbl>, BF_Colon <dbl>,
## # BF_Stomach <dbl>, OA_Stomach1 <dbl>, OA_Stomach2 <dbl>,
## # OA_Stomach3 <dbl>
select: Select/rename variables/columns by name# Selecting columns that match a pattern
gse69360 %>%
select(Geneid, matches(".F_"))
## # A tibble: 57,820 x 5
## Geneid AF_Colon AF_Stomach BF_Colon BF_Stomach
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000223972.4 7 2 0 3
## 2 ENSG00000227232.4 102 41 153 157
## 3 ENSG00000243485.2 0 0 0 0
## 4 ENSG00000237613.2 0 0 0 0
## 5 ENSG00000268020.2 0 0 0 0
## 6 ENSG00000240361.1 0 0 0 0
## 7 ENSG00000186092.4 0 0 0 0
## 8 ENSG00000238009.2 4 0 5 4
## 9 ENSG00000239945.1 0 0 0 0
## 10 ENSG00000233750.3 2 3 3 1
## # … with 57,810 more rows
# Excluding specific columns
gse69360 %>%
select(-Chr, -Start, -End, -Strand, -Length)
## # A tibble: 57,820 x 20
## Geneid AA_Colon AA_Heart AA_Kidney AA_Liver AA_Lung
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG0000022… 1 0 0 0 0
## 2 ENSG0000022… 36 21 32 28 17
## 3 ENSG0000024… 0 0 0 0 0
## 4 ENSG0000023… 0 0 0 0 0
## 5 ENSG0000026… 0 0 0 0 0
## 6 ENSG0000024… 0 0 0 0 0
## 7 ENSG0000018… 0 0 0 0 0
## 8 ENSG0000023… 3 0 1 2 1
## 9 ENSG0000023… 0 0 0 0 0
## 10 ENSG0000023… 0 0 0 0 0
## # … with 57,810 more rows, and 14 more variables:
## # AA_Stomach <dbl>, AF_Colon <dbl>, AF_Stomach <dbl>,
## # BA_Colon <dbl>, BA_Heart <dbl>, BA_Kidney <dbl>,
## # BA_Liver <dbl>, BA_Lung <dbl>, BA_Stomach <dbl>,
## # BF_Colon <dbl>, BF_Stomach <dbl>, OA_Stomach1 <dbl>,
## # OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
# Excluding columns matching a pattern
gse69360 %>%
select(-matches("[AF]_"))
## # A tibble: 57,820 x 6
## Geneid Chr Start End Strand Length
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 ENSG000… chr1;chr1;… 11869;125… 12227;127… +;+;+;+ 1756
## 2 ENSG000… chr1;chr1;… 14363;149… 14829;150… -;-;-;… 2073
## 3 ENSG000… chr1;chr1;… 29554;302… 30039;306… +;+;+ 1021
## 4 ENSG000… chr1;chr1;… 34554;352… 35174;354… -;-;- 1219
## 5 ENSG000… chr1;chr1 52473;548… 53312;549… +;+ 947
## 6 ENSG000… chr1 62948 63887 + 940
## 7 ENSG000… chr1 69091 70008 + 918
## 8 ENSG000… chr1;chr1;… 89295;920… 91629;922… -;-;-;… 3569
## 9 ENSG000… chr1;chr1 89551;902… 90050;911… -;- 1319
## 10 ENSG000… chr1 131025 134836 + 3812
## # … with 57,810 more rows
# Select then Filter
gse69360 %>%
select(Geneid, Chr, Length, matches("[AF]_")) %>%
filter(grepl("chrY", Chr) | Length <= 100)
## # A tibble: 4,671 x 22
## Geneid Chr Length AA_Colon AA_Heart AA_Kidney AA_Liver
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00… chr1… 90 0 0 0 0
## 2 ENSG00… chr1… 57 0 0 0 0
## 3 ENSG00… chr1 95 5 0 2 0
## 4 ENSG00… chr1 90 0 0 0 0
## 5 ENSG00… chr1 83 7 0 1 2
## 6 ENSG00… chr1 96 0 0 0 0
## 7 ENSG00… chr1 70 0 0 0 0
## 8 ENSG00… chr1 73 0 0 0 0
## 9 ENSG00… chr1 89 0 0 0 0
## 10 ENSG00… chr1 70 0 0 0 0
## # … with 4,661 more rows, and 15 more variables:
## # AA_Lung <dbl>, AA_Stomach <dbl>, AF_Colon <dbl>,
## # AF_Stomach <dbl>, BA_Colon <dbl>, BA_Heart <dbl>,
## # BA_Kidney <dbl>, BA_Liver <dbl>, BA_Lung <dbl>,
## # BA_Stomach <dbl>, BF_Colon <dbl>, BF_Stomach <dbl>,
## # OA_Stomach1 <dbl>, OA_Stomach2 <dbl>, OA_Stomach3 <dbl>
mutate: Adds new variables; keeps existing variablestransmute: Adds new variables; drops existing variables# Excluding columns matching a condition
gse69360 %>%
select(-matches("[AF]_")) %>%
head(., 10) %>% View()
# Storing gene location information in a seprate data frame
gene_loc <- gse69360 %>% # saving output to a variable
select(-matches("[AF]_")) %>% # select columns
mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) %>% # remove isoform no.
mutate(Chr = gsub(";.*$", "", gse69360$Chr)) %>% # keep the first element for Chr
mutate(Start = as.numeric(gsub(";.*$", "", gse69360$Start))) %>% # "" for Start
mutate(End = as.numeric(gsub(";.*$", "", gse69360$End))) %>% # "" for End
mutate(Strand = gsub(";.*$", "", gse69360$Strand)) # "" for Strand
# Check to see if you have what you expected!
gene_loc
## # A tibble: 57,820 x 6
## Geneid Chr Start End Strand Length
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 ENSG00000223972 chr1 11869 12227 + 1756
## 2 ENSG00000227232 chr1 14363 14829 - 2073
## 3 ENSG00000243485 chr1 29554 30039 + 1021
## 4 ENSG00000237613 chr1 34554 35174 - 1219
## 5 ENSG00000268020 chr1 52473 53312 + 947
## 6 ENSG00000240361 chr1 62948 63887 + 940
## 7 ENSG00000186092 chr1 69091 70008 + 918
## 8 ENSG00000238009 chr1 89295 91629 - 3569
## 9 ENSG00000239945 chr1 89551 90050 - 1319
## 10 ENSG00000233750 chr1 131025 134836 + 3812
## # … with 57,810 more rows
View(head(gene_loc, 10))
# Creating new variables
gene_loc %>%
mutate(kbStart = Start/1000, # creates new variables/columns
kbEnd = End/1000,
kbLength = Length/1000)
## # A tibble: 57,820 x 9
## Geneid Chr Start End Strand Length kbStart kbEnd
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 ENSG0000… chr1 11869 12227 + 1756 11.9 12.2
## 2 ENSG0000… chr1 14363 14829 - 2073 14.4 14.8
## 3 ENSG0000… chr1 29554 30039 + 1021 29.6 30.0
## 4 ENSG0000… chr1 34554 35174 - 1219 34.6 35.2
## 5 ENSG0000… chr1 52473 53312 + 947 52.5 53.3
## 6 ENSG0000… chr1 62948 63887 + 940 62.9 63.9
## 7 ENSG0000… chr1 69091 70008 + 918 69.1 70.0
## 8 ENSG0000… chr1 89295 91629 - 3569 89.3 91.6
## 9 ENSG0000… chr1 89551 90050 - 1319 89.6 90.0
## 10 ENSG0000… chr1 131025 134836 + 3812 131. 135.
## # … with 57,810 more rows, and 1 more variable:
## # kbLength <dbl>
# Creating new variables & dropping old ones
gene_loc %>%
transmute(kbStart = Start/1000, # drops original columns
kbEnd = End/1000,
kbLength = Length/1000)
## # A tibble: 57,820 x 3
## kbStart kbEnd kbLength
## <dbl> <dbl> <dbl>
## 1 11.9 12.2 1.76
## 2 14.4 14.8 2.07
## 3 29.6 30.0 1.02
## 4 34.6 35.2 1.22
## 5 52.5 53.3 0.947
## 6 62.9 63.9 0.94
## 7 69.1 70.0 0.918
## 8 89.3 91.6 3.57
## 9 89.6 90.0 1.32
## 10 131. 135. 3.81
## # … with 57,810 more rows
distinct: Pick unique entriesarrange: Arrange rows by variables# Pick only the unique entries in a column
gene_loc %>%
distinct(Chr)
## # A tibble: 25 x 1
## Chr
## <chr>
## 1 chr1
## 2 chr2
## 3 chr3
## 4 chr4
## 5 chr5
## 6 chr6
## 7 chr7
## 8 chr8
## 9 chr9
## 10 chr10
## # … with 15 more rows
gene_loc %>%
distinct(Strand)
## # A tibble: 2 x 1
## Strand
## <chr>
## 1 +
## 2 -
# Pick unique combinations
gene_loc %>%
distinct(Chr, Strand)
## # A tibble: 50 x 2
## Chr Strand
## <chr> <chr>
## 1 chr1 +
## 2 chr1 -
## 3 chr2 -
## 4 chr2 +
## 5 chr3 +
## 6 chr3 -
## 7 chr4 -
## 8 chr4 +
## 9 chr5 +
## 10 chr5 -
## # … with 40 more rows
# Then sort aka arrange your data
gene_loc %>%
arrange(desc(Chr)) # sort in descending order
## # A tibble: 57,820 x 6
## Geneid Chr Start End Strand Length
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 ENSGR0000228572 chrY 120410 120513 + 259
## 2 ENSGR0000182378 chrY 142989 143061 + 6191
## 3 ENSGR0000178605 chrY 170025 170137 - 2363
## 4 ENSGR0000226179 chrY 231725 231983 + 428
## 5 ENSGR0000167393 chrY 244698 249631 - 9015
## 6 ENSGR0000266731 chrY 375316 375416 - 101
## 7 ENSGR0000234958 chrY 434510 434837 - 328
## 8 ENSGR0000229232 chrY 455971 456087 - 117
## 9 ENSGR0000185960 chrY 535079 535337 + 4384
## 10 ENSGR0000237531 chrY 900956 901162 + 588
## # … with 57,810 more rows
gene_loc %>%
arrange(Chr, Length) # sort by Chr, then Length
## # A tibble: 57,820 x 6
## Geneid Chr Start End Strand Length
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 ENSG00000268141 chr1 154585067 154585080 + 21
## 2 ENSG00000224335 chr1 147706573 147706607 + 35
## 3 ENSG00000263526 chr1 95211416 95211456 - 41
## 4 ENSG00000230610 chr1 38292232 38292262 - 45
## 5 ENSG00000252056 chr1 236026688 236026737 - 50
## 6 ENSG00000238705 chr1 26881033 26881084 + 52
## 7 ENSG00000264650 chr1 45499641 45499692 - 52
## 8 ENSG00000265769 chr1 52366592 52366643 - 52
## 9 ENSG00000251795 chr1 93303575 93303627 + 53
## 10 ENSG00000238310 chr1 109750416 109750470 - 55
## # … with 57,810 more rows
# arrange(Chr, -Length) # to reverse sort by 'numeric' Length
summarize: Reduces multiple values down to a single valuegroup_by: Combine entries by one or more variables# Combine by a variable, then calculate summary statistics for each group
gene_loc %>%
group_by(Chr) %>% # combine rows by Chr
summarize(numGenes = n(), # then summarise, number of genes/Chr
startoffirstGene = min(Start)) # min to get the first Start location
## # A tibble: 25 x 3
## Chr numGenes startoffirstGene
## <chr> <int> <dbl>
## 1 chr1 5363 11869
## 2 chr10 2260 90652
## 3 chr11 3208 75780
## 4 chr12 2818 67607
## 5 chr13 1217 19041312
## 6 chr14 2244 19110203
## 7 chr15 2080 20083769
## 8 chr16 2343 61553
## 9 chr17 2903 4961
## 10 chr18 1127 11103
## # … with 15 more rows
# Example to show you can use all math/stat functions to summarize data groups
gene_loc %>%
arrange(Length) %>%
group_by(Chr, Strand) %>%
summarize(numGenes = n(),
smallestGene = first(Geneid),
minLength = min(Length),
firstqLength = quantile(Length, 0.25),
medianLength = median(Length),
iqrLength = IQR(Length),
thirdqLength = quantile(Length, 0.75),
maxLength = max(Length),
longestGene = last(Geneid))
## `summarise()` has grouped output by 'Chr'. You can override using the `.groups` argument.
## # A tibble: 50 x 11
## # Groups: Chr [25]
## Chr Strand numGenes smallestGene minLength firstqLength
## <chr> <chr> <int> <chr> <dbl> <dbl>
## 1 chr1 - 2651 ENSG0000026… 41 354.
## 2 chr1 + 2712 ENSG0000026… 21 394.
## 3 chr10 - 1109 ENSG0000026… 52 322
## 4 chr10 + 1151 ENSG0000027… 27 356
## 5 chr11 - 1605 ENSG0000026… 30 477
## 6 chr11 + 1603 ENSG0000025… 37 476
## 7 chr12 - 1415 ENSG0000025… 28 375
## 8 chr12 + 1403 ENSG0000026… 30 402
## 9 chr13 - 631 ENSG0000021… 48 297
## 10 chr13 + 586 ENSG0000026… 57 371.
## # … with 40 more rows, and 5 more variables:
## # medianLength <dbl>, iqrLength <dbl>,
## # thirdqLength <dbl>, maxLength <dbl>, longestGene <chr>
# Renaming chromsosomes & declaring them as factors w/ an intrinsic order
# gene_loc$Chr <- factor(gene_loc$Chr,
# levels = paste("chr",
# c((1:22), "X", "Y", "M"),
# sep=""))
# Saving your data locally
gene_loc %>%
write_tsv(here("data/GSE69360.gene-locations.txt"))
Let’s combine everything from above to tidy the full GSE69360 dataset.
Dataset details:
# Extracting just the expression values & cleaning it up
View(head(gse69360, 50))
gene_counts <- gse69360 %>%
select(-Chr, -Start, -End, -Strand, -Length) %>% # another way to select just the expression data
rename(OA_Stomach = OA_Stomach1) %>% # rename couple of columns
mutate(OA_Stomach2 = NULL, OA_Stomach3 = NULL) %>% # remove a couple of columns
mutate(Geneid = gsub("\\.[0-9]*$", "", Geneid)) # cleanup data a specific column
logcpm <- gene_counts %>%
select(-Geneid) %>%
mutate_all(function(x) { log2((x*(1e+6)/sum(x)) + 1) } ) # convert counts in each sample to counts-per-million
summary(logcpm)
## AA_Colon AA_Heart AA_Kidney
## Min. : 0.00000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.06883 Median : 0.0000 Median : 0.0000
## Mean : 1.10187 Mean : 0.8103 Mean : 0.8785
## 3rd Qu.: 1.33045 3rd Qu.: 0.5151 3rd Qu.: 0.7181
## Max. :18.41311 Max. :18.7478 Max. :18.8061
## AA_Liver AA_Lung AA_Stomach
## Min. : 0.00000 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.00000 1st Qu.: 0.000 1st Qu.: 0.00000
## Median : 0.05563 Median : 0.000 Median : 0.06964
## Mean : 0.98788 Mean : 1.049 Mean : 0.93120
## 3rd Qu.: 0.95867 3rd Qu.: 1.123 3rd Qu.: 0.91851
## Max. :17.85244 Max. :17.722 Max. :17.85986
## AF_Colon AF_Stomach BA_Colon
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000
## Median : 0.1995 Median : 0.0000 Median : 0.08114
## Mean : 1.2903 Mean : 0.7748 Mean : 0.98586
## 3rd Qu.: 1.8951 3rd Qu.: 0.8105 3rd Qu.: 1.10906
## Max. :17.4620 Max. :18.3586 Max. :18.75742
## BA_Heart BA_Kidney BA_Liver
## Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0000
## Median : 0.04537 Median : 0.05856 Median : 0.0000
## Mean : 0.80833 Mean : 0.96659 Mean : 0.8194
## 3rd Qu.: 0.68444 3rd Qu.: 1.11092 3rd Qu.: 0.5974
## Max. :18.92846 Max. :18.87617 Max. :17.7612
## BA_Lung BA_Stomach BF_Colon
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.1643 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.4456 Median : 0.0000 Median : 0.1793
## Mean : 1.3873 Mean : 0.7528 Mean : 1.3716
## 3rd Qu.: 1.9392 3rd Qu.: 0.6261 3rd Qu.: 1.9775
## Max. :17.5495 Max. :18.3249 Max. :15.8531
## BF_Stomach OA_Stomach
## Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.3085 Median : 0.1003
## Mean : 1.5164 Mean : 1.0800
## 3rd Qu.: 2.4221 3rd Qu.: 1.3692
## Max. :15.7256 Max. :17.9600
gene_logcpm <- gene_counts %>% # a new dataframe with logcpm values
select(Geneid) %>%
bind_cols(logcpm) %>% # bind the logcpm matrix to the geneids
gather(-Geneid, key = "Sample", value = "Logcpm") %>% # convert to tidy data
separate(Sample, # cleanup complex variables
into = c("Source", "Stage", "Tissue"),
sep = c(1,2),
remove = F) %>% # keep original variable
mutate(Tissue = gsub("^_", "", Tissue),
Stage = ifelse(Stage == "A", "Adult", "Fetus"))
View(head(gene_logcpm, 50))
# Plotting the distribution of gene-expression in each sample
gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_boxplot(outlier.size = 0.2, outlier.shape = 0.2) +
scale_y_continuous(limits = c(0, 1)) +
coord_flip() +
theme_minimal()
## Warning: Removed 259347 rows containing non-finite values
## (stat_boxplot).
ggplotCreating a plot w/ Grammar of Graphics
- Recap and continuation of dplyr
- Basics of plotting data with ggplot2:
data,aes,geom- Customization: Colors, labels, and legends
ggplot, factor, aesgeom_bar, geom_histogramfacet_wrapscale_x_log10, labs, coord_flip, theme, theme_minimalgene_loc %>% # data
ggplot(aes(x = Chr)) + # aesthetics: what to plot?
geom_bar() # geometry: how to plot?
gene_loc$Chr <- factor(gene_loc$Chr,
levels = paste("chr",
c((1:22), "X", "Y", "M"),
sep=""))
plot_chr_numgenes <- gene_loc %>%
ggplot(aes(x = Chr)) +
geom_bar()
plot_chr_numgenes
plot_chr_numgenes +
coord_flip() +
theme_minimal()
plot_chr_numgenes +
labs(title = "No. genes per chromosome",
x = "Chromosome",
y = "No. of genes") +
theme_minimal() +
coord_flip()
gene_loc %>%
ggplot(aes(x = Length)) +
geom_histogram(color = "white") +
scale_x_log10() +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
plot_chr_genelength <- gene_loc %>%
ggplot(aes(x = Length, fill = Chr)) +
geom_histogram(color = "white") +
scale_x_log10() +
theme_minimal() +
facet_wrap(~Chr, scales = "free_y")
plot_chr_genelength
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
plot_chr_genelength +
theme(legend.position = "none") +
labs(x = "Gene length (log-scale)",
y = "No. of genes")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.
geom_pointgeom_abline, geom_vline, geom_hlinegeom_smooth, geom_text_repelgene_loc %>%
ggplot(aes(x = End-Start, y = Length)) +
geom_point()
plot_strend_length <- gene_loc %>%
ggplot(aes(x = End-Start, y = Length)) +
geom_point(alpha = 0.1, size = 0.5, color = "grey", fill = "grey")
plot_strend_length
plot_strend_length <- plot_strend_length +
scale_x_log10("End-Start") +
scale_y_log10("Gene length") +
theme_minimal()
plot_strend_length
## Warning: Transformation introduced infinite values in
## continuous x-axis
plot_strend_length +
geom_abline(intercept = 0, slope = 1, col = "red") +
geom_hline(yintercept = 500, color = "blue") +
geom_vline(xintercept = 1000, color = "orange")
## Warning: Transformation introduced infinite values in
## continuous x-axis
gene_loc %>%
group_by(Chr) %>%
summarize(meanLength = mean(Length), numGenes = n())
## # A tibble: 25 x 3
## Chr meanLength numGenes
## <fct> <dbl> <int>
## 1 chr1 2258. 5363
## 2 chr2 2304. 4047
## 3 chr3 2382. 3101
## 4 chr4 2109. 2563
## 5 chr5 2188. 2859
## 6 chr6 2124. 2905
## 7 chr7 2164. 2876
## 8 chr8 2036. 2386
## 9 chr9 2123. 2323
## 10 chr10 2160. 2260
## # … with 15 more rows
gene_loc %>%
group_by(Chr) %>%
summarize(meanLength = mean(Length), numGenes = n()) %>%
ggplot(aes(x = numGenes, y = meanLength)) +
geom_point()
# install.packages("ggrepel", dependencies=T)
library(ggrepel)
gene_loc %>%
group_by(Chr) %>%
summarize(meanLength = mean(Length), numGenes = n()) %>%
ggplot(aes(x = numGenes, y = meanLength)) +
geom_point() +
geom_smooth(color = "lightblue", alpha = 0.1) +
labs(x = "No. of genes", y = "Mean gene length") +
geom_text_repel(aes(label = Chr), color="red", segment.color="grey80") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
geom_boxplot, geom_violinscale_y_continuousgene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_boxplot() +
coord_flip() +
theme_minimal()
gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_violin() +
scale_y_continuous(limits = c(0, 0.5)) +
coord_flip() +
theme_minimal()
## Warning: Removed 320317 rows containing non-finite values
## (stat_ydensity).
# Plotting the distribution of gene-expression in each sample
plot_sample_bxp <- gene_logcpm %>%
ggplot(aes(x = Sample, y = Logcpm, color = Tissue, linetype = Stage)) +
geom_boxplot(outlier.size = 0.2, outlier.alpha = 0.2) +
scale_y_continuous(limits = c(0, 1)) +
coord_flip() +
theme_minimal()
plot_sample_bxp
## Warning: Removed 259347 rows containing non-finite values
## (stat_boxplot).
# Plotting scatterplot of 2 sets of samples
plot_ffcolon_scatter <- gene_logcpm %>%
filter(Sample == "AF_Colon" | Sample == "BF_Colon") %>%
select(Geneid, Sample, Logcpm) %>%
spread(key = Sample, value = Logcpm) %>%
ggplot(aes(x = AF_Colon, y = BF_Colon)) +
geom_point(alpha = 0.1, size = 0.5) +
geom_smooth(method=lm) +
theme_minimal()
plot_ffcolon_scatter
## `geom_smooth()` using formula 'y ~ x'
# Finding genes with high variance across samples
num_totgenes <- gene_logcpm %>%
distinct(Geneid) %>%
nrow()
highvar_genes <- gene_logcpm %>%
group_by(Geneid) %>%
summarize(iqr = IQR(Logcpm)) %>%
top_n((ceiling(num_totgenes*0.05)), iqr) %>%
pull(Geneid)
length(highvar_genes)
## [1] 2891
# Plotting expression of high-var Y chr genes across samples
chry_highvar_genes <- gene_loc %>%
filter(Chr == "chrY" & Geneid %in% highvar_genes) %>%
pull(Geneid)
gene_logcpm %>%
filter(Geneid %in% chry_highvar_genes)
## # A tibble: 238 x 6
## Geneid Sample Source Stage Tissue Logcpm
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 ENSG00000129824 AA_Colon A Adult Colon 6.12
## 2 ENSG00000067646 AA_Colon A Adult Colon 4.89
## 3 ENSG00000231535 AA_Colon A Adult Colon 3.12
## 4 ENSG00000099725 AA_Colon A Adult Colon 3.55
## 5 ENSG00000233864 AA_Colon A Adult Colon 3.73
## 6 ENSG00000114374 AA_Colon A Adult Colon 5.67
## 7 ENSG00000067048 AA_Colon A Adult Colon 5.62
## 8 ENSG00000183878 AA_Colon A Adult Colon 6.05
## 9 ENSG00000165246 AA_Colon A Adult Colon 2.31
## 10 ENSG00000176728 AA_Colon A Adult Colon 2.52
## # … with 228 more rows
plot_chry_highvar_boxplot <- gene_logcpm %>%
filter(Geneid %in% chry_highvar_genes) %>%
ggplot(aes(x = reorder(Sample, Logcpm, FUN = median),
y = Logcpm,
color = Sample)) +
geom_boxplot() +
coord_flip() +
theme_minimal() +
theme(legend.position = "none")
plot_chry_highvar_boxplot
RMarkdownSave a ggplot (or other grid object) with sensible defaults
library(tidyverse)
# Save your file name
plot1 <- "my_plot1.tiff"
# Save your absolute/relative path
my_full_path <- here("data")
# To save as a tab-delimited text file ...
ggsave(filename=plot1,
plot=static_plot,
device="tiff",
path=my_full_path,
dpi=600)
Write a data frame to a delimited file
library(tidyverse)
# Save your file name
filename <- "my_new_data.txt"
# Save your absolute/relative path
my_full_path <- here("data")
# To save as a tab-delimited text file ...
write_tsv(x=my_newly_formatted_data, # your final reformatted dataset
path=paste(my_full_path, filename, "/"), # Absolute path recommended.
# However, you can directly use 'filename' here
# if you are saving the file in the same directory
# as your code.
col_names=T) # if you want the column names to be
# saved in the first row, recommended
# Alternatively, you could save it as a comma-separated text file
write_csv(x=my_newly_formatted_data,
path=my_path,
col_names=T)
# Or save it with any other delimiter
# choose wisely, pick a delim that's not part of your dataframe
write_delim(x=my_newly_formatted_data,
path=my_path,
col_names=T,
delim="---")
| Option | Description |
|---|---|
| Part 1 | Getting Started |
install.packages |
Download and install packages from CRAN-like repositories or from local files |
library |
Library and require load and attach add-on packages |
package::function |
To run a function w/o loading the package |
| Import | tidyverse > readr & readxl |
read_delim |
Read a delimited file (incl csv, tsv) into a tibble |
read_csv, read_tsv |
read_csv() and read_tsv() are special cases of the general read_delim() |
read_excel |
Read xls and xlsx files |
here |
Part of here package. To set paths relative to your current project directory |
| Data snapshot | |
str |
Compactly Display the Structure of an Arbitrary R Object |
head |
Return the First or Last Part of an Object |
glimpse |
Get a glimpse of your data |
View |
Invoke a Data Viewer |
kable |
Create tables in LaTeX, HTML, Markdown and reStructuredText |
paged_table |
Create a table in HTML with support for paging rows and columns |
| Part 2 | tidyverse > tidyr |
gather |
Gather Columns Into Key-Value Pairs (COLS -> ROWS) |
spread |
Spread a key-value pair across multiple columns |
separate |
Separate one column into multiple column |
unite |
Unite multiple columns into one |
| Part 3 | tidyverse > dplyr |
filter |
Return rows with matching conditions |
select |
Select/rename variables by name |
mutate |
Add new variables |
transmute |
Add new variables & drops existing variables |
distinct |
Get unique entries |
arrange |
Arrange rows by variables |
summarise |
Reduces multiple values down to a single value |
group_by |
Group by one or more variables |
join |
Join two tbls together: left_join, right_join, inner_join |
bind |
Efficiently bind multiple data frames by row and column: bind_rows, bind_cols |
setops |
Set operations: intersect, union, setdiff, setequal |
| Part 4 | tidyverse > ggplot |
ggplot |
Create a new ggplot |
| Aesthetics | |
aes |
Specify details on what to plot |
factor |
Specify a variable to be ordered & categorical |
facet_wrap |
Split into multiple sub-plots based on a categorical variable |
| Geometries | |
geom_bar |
Create a barplot |
geom_histogram |
Create a histogram |
geom_point |
Create a scatter plot |
geom_boxplot |
Create a boxplot |
geom_violin |
Create a violin plot |
geom_abline, geom_hline, geom_vline |
Add slanted, horizontal, or vertical lines |
geom_smooth |
Add a smooth curve that fits the data points |
| Plot customization | |
theme, theme_minimal, theme_bw |
Adjust themes to minimal, black/white, etc |
scale_x_log10 |
Change x/y axes to log scale |
scale_y_continuous |
Change x/y axes to continuous & set limits |
coord_flip |
Flip x & y coordinates |
labs |
Specify axes labels and plot titles |
| Part 5 | Export & Wrap-up |
| Export | tidyverse > readr |
ggsave |
Save a ggplot (or other grid object) with sensible defaults |
write_delim |
Write a data frame to a delimited file |
write_tsv |
write_delim customized for tab-separated values |
write_csv |
write_delim customized for comma-separated values |
Arjun Krishnan and I co-developed the content for this workshop.